In the next two hours we are going to explore some of the ways unstructured data can be plotted using R. The list of plot is by no means exhaustive and it is just a short overview of possible options. A very good starting point to explore graphs remains https://r-graph-gallery.com/
Let’s say that you are exploring the results of OCR’d documents and you want to check where the low percentage of confidentiality of in OCR it comes from and if it cluster around certain areas Let’s see the original file:
image1 <- image_read("https://raw.githubusercontent.com/DCS-training/Good-Data-Visualisation-with-R/main/DataClass3/page10.jpg")
image1
info <- image_info(image1)
print(info)
## # A tibble: 1 × 7
## format width height colorspace matte filesize density
## <chr> <int> <int> <chr> <lgl> <int> <chr>
## 1 JPEG 2481 3508 Gray FALSE 1120315 300x300
ok this seems quite a well defined images let’s see the results of the OCR on it
PrideAndPrejudiceCh10<- read_csv("https://raw.githubusercontent.com/DCS-training/Good-Data-Visualisation-with-R/main/DataClass3/OCRPrideAndPrejudice.csv")
PrideAndPrejudiceCh10
## # A tibble: 280 × 8
## word confidence Minx Miny Maxx Maxy MeanX Meany
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 PRIDE 95.8 743 162 938 202 840 182
## 2 AND 96.0 982 162 1119 201 1050 182
## 3 PREJUDICE. 90.5 1165 160 1534 215 1350 188
## 4 3 96.5 2027 163 2055 218 2041 190
## 5 “What 95.2 287 292 513 343 400 318
## 6 is 96.0 542 291 587 343 564 317
## 7 his 96.9 617 290 704 342 660 316
## 8 name?” 92.1 735 289 984 343 860 316
## 9 “ 91.2 288 387 314 409 301 398
## 10 Bingley.” 95.8 337 381 627 451 482 416
## # ℹ 270 more rows
str(PrideAndPrejudiceCh10)
## spc_tbl_ [280 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ word : chr [1:280] "PRIDE" "AND" "PREJUDICE." "3" ...
## $ confidence: num [1:280] 95.8 96 90.5 96.5 95.2 ...
## $ Minx : num [1:280] 743 982 1165 2027 287 ...
## $ Miny : num [1:280] 162 162 160 163 292 291 290 289 387 381 ...
## $ Maxx : num [1:280] 938 1119 1534 2055 513 ...
## $ Maxy : num [1:280] 202 201 215 218 343 343 342 343 409 451 ...
## $ MeanX : num [1:280] 840 1050 1350 2041 400 ...
## $ Meany : num [1:280] 182 182 188 190 318 317 316 316 398 416 ...
## - attr(*, "spec")=
## .. cols(
## .. word = col_character(),
## .. confidence = col_double(),
## .. Minx = col_double(),
## .. Miny = col_double(),
## .. Maxx = col_double(),
## .. Maxy = col_double(),
## .. MeanX = col_double(),
## .. Meany = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
mean(PrideAndPrejudiceCh10$confidence)
## [1] 94.59075
I can do that by plotting the ocr confidence level on top of the original page
ggplot(PrideAndPrejudiceCh10, aes(MeanX, Meany, colour= confidence)) + #Colour code by confidence
background_image(image1)+ # Select the image I want as a background
geom_point(shape=15, size=5, alpha=0.5)+#plot them as squared points
coord_cartesian(xlim = c(0, 2481),ylim = c(3508, 0),
expand = FALSE)+ #Use the sizes of the image to scale the graph
scale_colour_continuous(low="red", high="green") + #Create a continuous scale from green to red to colour-code the results
theme_bw()# b/w background
First I import the image
image2 <- image_read("https://raw.githubusercontent.com/DCS-training/Good-Data-Visualisation-with-R/main/DataClass3/page1.jpg")
image2
info <- image_info(image2)
print(info)
## # A tibble: 1 × 7
## format width height colorspace matte filesize density
## <chr> <int> <int> <chr> <lgl> <int> <chr>
## 1 JPEG 3307 4677 Gray FALSE 4573993 400x400
Ok this seems quite a well defined images let’s see the results of the OCR on it
Again I am importing them from GitHub
PrideAndPrejudiceIncipit<- read_csv("https://raw.githubusercontent.com/DCS-training/Good-Data-Visualisation-with-R/main/DataClass3/PrideAndPrejudiceIncipit.csv")
PrideAndPrejudiceIncipit
## # A tibble: 518 × 8
## word confidence Minx Miny Maxx Maxy MeanX Meany
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Rts. 3.53 2296 610 2430 655 2363 632
## 2 EX. 86.2 1953 591 2073 697 2013 644
## 3 “he 51.9 2243 638 2361 813 2302 726
## 4 * 82.2 2338 742 2362 763 2350 752
## 5 ioe 32.9 2421 759 2589 832 2505 796
## 6 2 30.3 1178 658 1224 663 1201 660
## 7 YWD 1.06 1398 629 1972 920 1685 774
## 8 2 50.1 2043 678 2146 838 2094 758
## 9 Ws 44.4 2298 730 2613 924 2456 827
## 10 9 91.0 2061 985 2130 1049 2096 1017
## # ℹ 508 more rows
str(PrideAndPrejudiceIncipit)
## spc_tbl_ [518 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ word : chr [1:518] "Rts." "EX." "“he" "*" ...
## $ confidence: num [1:518] 3.53 86.23 51.92 82.25 32.92 ...
## $ Minx : num [1:518] 2296 1953 2243 2338 2421 ...
## $ Miny : num [1:518] 610 591 638 742 759 658 629 678 730 985 ...
## $ Maxx : num [1:518] 2430 2073 2361 2362 2589 ...
## $ Maxy : num [1:518] 655 697 813 763 832 ...
## $ MeanX : num [1:518] 2363 2013 2302 2350 2505 ...
## $ Meany : num [1:518] 632 644 726 752 796 ...
## - attr(*, "spec")=
## .. cols(
## .. word = col_character(),
## .. confidence = col_double(),
## .. Minx = col_double(),
## .. Miny = col_double(),
## .. Maxx = col_double(),
## .. Maxy = col_double(),
## .. MeanX = col_double(),
## .. Meany = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
mean(PrideAndPrejudiceIncipit$confidence)
## [1] 42.67584
That is not good at all but can we see where the issues are through data visualisation
We can do so by yet again plotting it
ggplot(PrideAndPrejudiceIncipit, aes(MeanX, Meany, colour= confidence)) + #Colour code by confidence
background_image(image2)+ # Select the image I want as a background
geom_point(shape=15, size=5, alpha=0.5)+#plot them as squared points
coord_cartesian(xlim = c(0, 3307),ylim = c(4677, 0),
expand = FALSE)+ #Use the sizes of the image to scale the graph
scale_colour_continuous(low="red", high="green") + #Create a continuous scale from green to red to colour-code the results
theme_bw()# b/w background
This is what you can do with OCR data let’s now look at proper textual data
Before being able to plot unstructured data there is some cleaning and setting up that is always needed.
So let’s go through step by step:
We do so yet again from the GitHub
ScotAccount<- read_csv("https://raw.githubusercontent.com/DCS-training/Good-Data-Visualisation-with-R/main/DataClass3/ScotlandParishes.csv")
We do that by running summary
summary(ScotAccount)
## title text Type TypeDescriptive
## Length:27065 Length:27065 Length:27065 Length:27065
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## RecordID Area Parish Year
## Length:27065 Length:27065 Length:27065 Length:27065
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## Tables
## Length:27065
## Class :character
## Mode :character
This is a very large dataset
The ‘Old’Statistical Account(1791-99), under the direction of Sir John Sinclair of Ulbster, and the ’New’ Statistical Account(1834-45) are reports of life during.
They offer uniquely rich and detailed parish reports for the whole of Scotland, covering a vast range of topics including agriculture, education, trades, religion and social customs.
https://stataccscot.edina.ac.uk/static/statacc/dist/home
Everything from changing fashions in dress to the different attitudes to smallpox inoculation and resulting high infant mortality between the north and south of Scotland
Our datasets are 27065 records corresponding to single reports from the statistical accounts about a certain Parish.
A quanteda corpus is a special type of data structure used to represent a collection of texts. It is designed to facilitate the analysis of textual data in a quantitative manner. The quanteda package provides functions for creating, manipulating, and analyzing corpora.
If you do not specify a column the system will search for a variable named text if you do not have one you need to specify the name of column that contain the text you want to analyse.
ScotAccountCorpus<-corpus(ScotAccount)
“Tokenize” is a term used in natural language processing (NLP) and text analysis to refer to the process of breaking down a text into individual units, which are often referred to as “tokens.” Tokens can be words, subwords, or other units of text, depending on the level of granularity desired for analysis.
Definition: Breaking a text into individual words. Example: “The quick brown fox” would be tokenized into [“The”, “quick”, “brown”, “fox”].
Definition: Breaking a text into individual sentences. Example: “This is the first sentence. And this is the second.” would be tokenized into [“This is the first sentence.”, “And this is the second.”].
Tokenization is a fundamental step in many natural language processing tasks because it helps to convert raw text into a format that can be easily processed and analyzed by algorithms. Once tokenized, the resulting units can be used for tasks such as counting word frequencies, training machine learning models, or extracting linguistic features for further analysis.
In a dfm, each row corresponds to a document, and each column corresponds to a feature (typically a term or word). The matrix entries represent the frequency of each feature in each document. The term “feature” is used more broadly than “term” because the features can include not only individual words but also phrases or other linguistic units.
Let’s do these steps on our dataset
dfmat_ScotAccountCorpus <- corpus_subset(ScotAccountCorpus) |> # which data
tokens(remove_punct = TRUE,
remove_numbers = TRUE,
remove_symbols=TRUE) |> # Tokenise
tokens_select(min_nchar = 3)|> # Refine tokens by removing words shorter than 3 characters
tokens_remove(pattern = c(stopwords('english'), "Parish")) |> # Remove common + custom stopwords
dfm()|> # create a Document Feature Matrix
dfm_trim(min_termfreq = 2000, verbose = FALSE) # remove all words that are recurring less than 2000 times. Verbose hide the result of this step from the console output this is just to speed up the process.
Stop words are common words that are often filtered out during the preprocessing of natural language text data before analysis. These words are generally the most common words in a language and are considered to carry little or no meaningful information about the content of the text. Including stop words in text analysis can introduce noise and may not contribute much to the understanding of the underlying patterns.
Common examples of stop words in English include words like “the,” “and,” “is,” “in,” “to,” and so on. The specific list of stop words can vary depending on the context and the task at hand.
Word clouds (also known as text clouds or tag clouds) work in a simple way: the more a specific word appears in a source of textual data (such as a speech, blog post, or database), the bigger and bolder it appears in the word cloud
Let’s start from a simple one
set.seed(100) # to create reproducible results when writing code
textplot_wordcloud(dfmat_ScotAccountCorpus) # Plot our wordcloud
if we want to compare across specific areas of Scotland First I need to process the corpus again but I am looking only at data from Fife, Edinburgh, Haddington and Stirling.
CorpusSubset<-corpus_subset(ScotAccountCorpus, # which data
Area %in% c("Fife", "Edinburgh", "Haddington", "Stirling")) |> # select only some data from the Area column, only those from Fife, Edinburgh, Haddington and Stirling
tokens(remove_punct = TRUE,
remove_numbers = TRUE,
remove_symbols=TRUE) |> # tokenise
tokens_select(min_nchar = 3)|> # Refine tokens by removing words shorter than 3 characters
tokens_remove(pattern = c(stopwords('english'), "Parish", "Edinburgh", "Fife", "Haddington", "Stirling")) |> # remove English stopwords plus costum ones
dfm() |>
dfm_group(groups = Area) |> # create a Document Feature Matrix
dfm_trim(min_termfreq =500, verbose = FALSE) # remove all words that are recurring less than 500 times. Verbose hide the result of this step from the console output this is just to speed up the process.
set.seed(100) # to create reproducible results when writing code
textplot_wordcloud(CorpusSubset,comparison = TRUE) # Comparison true would divide the cloud by the areas nb max number of cluster is 8.
Let’s refine the colours
textplot_wordcloud(CorpusSubset,comparison = TRUE,
color = c("red","purple", "green", "blue"),# Select colours
max_words = 100, # max number of words displayed
min_size = 0.8,# min size font
max_size = 5)# max size font
We can also be fancy and we can shape the wordcloud onto the shape of Scotland
To do so we need to use a library ggwordcloud. this works with dataframes so we need to transform our DFM into a dataframe containing the frequency of words we can do this by extracting the frequency (you will need quanteda.textstats for this as well)
tstat_freq_Stats <- textstat_frequency(dfmat_ScotAccountCorpus, n = 75) # extracting the top 75 words
Then we get the image that we want to use in this case the shape of Scotland
# First we get the image
image_url <- "https://raw.githubusercontent.com/DCS-training/Good-Data-Visualisation-with-R/main/DataClass3/Scotland.png"
# Download the image to a local file
download.file(url = image_url, destfile = "scotland.png", mode = "wb")
set.seed(42)# Again this is to make it reproducible
# We can look at it
Scotland<-image_read("scotland.png")
Scotland
# Finally we define the image path
image_path <- here::here("Scotland.png") # specify where the image is (need the here library) Need to be a Png for handling transparency
ggplot(tstat_freq_Stats, aes(label = feature, size = frequency, colour = frequency)) +
geom_text_wordcloud_area(
mask = png::readPNG(image_path),
rm_outside = TRUE # This should remove text outside the mask area
) +
scale_size_area(max_size =35) +
scale_color_gradient(low = "darkblue", high = "blue") +
theme_minimal()
We extracted the frequency now let’s look at other ways to plot it (i.e. what are the most recurrent words in a series of texts)
ggplot(tstat_freq_Stats, aes(x = frequency, y = reorder(feature, frequency))) +
geom_point() +
labs(x = "Frequency", y = "Feature")+
theme_bw()
If you wanted to compare the frequency of a single term across different texts, you can also use textstat_frequency, group the frequency by Area and extract the term.
First we re-tokenise but without do a dfm
toks_corpus_Scot_subset <-
corpus_subset(ScotAccountCorpus) |> # Ceci n'est pas une pipe...
tokens()
freq_grouped <- textstat_frequency(dfm(toks_corpus_Scot_subset),
groups = Area)
freq_church <- subset(freq_grouped, freq_grouped$feature %in% "church")
ggplot(freq_church, aes(x = frequency, y = group)) +
geom_point() +
scale_x_continuous(limits = c(0, 1750), breaks = c(seq(0, 1750, 150))) +
labs(x = "Frequency", y = NULL,
title = 'Frequency of "Church"')+
theme_bw()
For this one we need to use a different dataset cause the one we were working on is too big.
So we are using the inaugurations speaches of the USA presidents.
The dataset is a default dataset that can be used directly
toks_corpus_inaugural_subset <-
corpus_subset(data_corpus_inaugural, Year > 1949) |>
tokens()
The term “keyword in context” (KWIC) refers to a method of displaying specific words or phrases in their surrounding context, typically within a larger body of text. This technique is commonly used in linguistics, information retrieval, and text analysis to better understand how certain words are used and the context in which they appear.
In a KWIC display, a specific word or phrase (the “keyword”) is centered in the middle of the display, and the text snippets or lines containing that keyword are shown in context, usually to the left and right of the keyword. This provides a quick and concise way to observe the usage patterns and surrounding words of a particular term.
kwic(toks_corpus_inaugural_subset, pattern = "american") |>
textplot_xray()
For the last bit we go back to our original dataset and we check how big the dataset is
It is always useful to extract information about the corpus we are working with Some methods for extracting information about the corpus:
# to explore this we focus on the text column of the ScotAccount
CorpusStat<-corpus(ScotAccount$text)
# Print doc in position 5 of the corpus
summary(CorpusStat, 5)
## Corpus consisting of 27065 documents, showing 5 documents:
##
## Text Types Tokens Sentences
## text1 94 171 1
## text2 167 399 1
## text3 122 201 1
## text4 123 262 1
## text5 146 273 1
# Check how many docs are in the corpus
ndoc(CorpusStat)
## [1] 27065
# Check number of characters in the first 10 documents of the corpus
nchar(CorpusStat[1:10])
## text1 text2 text3 text4 text5 text6 text7 text8 text9 text10
## 940 1906 1011 1245 1354 1422 1995 1916 1855 1977
# Check number of tokens in the first 10 documents
ntoken(CorpusStat[1:10])
## text1 text2 text3 text4 text5 text6 text7 text8 text9 text10
## 171 399 201 262 273 281 422 387 365 454
Can we create some better visualisation to look into it?
Yes of course.
Create a new vector with tokens for all articles and store the vector as a new data frame with three columns (Ntoken, Dataset, Date).
NtokenStats<-as.vector(ntoken(CorpusStat))
TokenScotland <-data.frame(Tokens=NtokenStats, title=ScotAccount$title, Area=ScotAccount$Area, Parish=ScotAccount$Parish)
Now we want to see how much material we have for each area. We can do that through pipes
BreakoutScotland<- TokenScotland |>
group_by(Area)|>
summarize(NReports=n(), MeanTokens=round(mean(Tokens)))
Now we can plot the trends.
This is done through the use of the ggplot package that is a very handy package that will allow you to print a very big variety of graphs
ggplot(BreakoutScotland, aes(x=Area, y=NReports))+ # Select data set and coordinates we are going to plot
geom_point(aes(size=MeanTokens, fill=MeanTokens),shape=21, stroke=1.5, alpha=0.9, colour="black")+ # Which graph I want
labs(x = "Areas", y = "Number of Reports", fill = "Mean of Tokens", size="Mean of Tokens", title="Number of Reports and Tokens in the Scotland Archive")+ # Rename labs and title
scale_size_continuous(range = c(5, 15))+ # Resize the dots to be bigger
geom_text(aes(label=MeanTokens))+ # Add the mean of tokens in the dots
scale_fill_viridis_c(option = "plasma")+ # Change the colour coding
theme_bw()+ # B/W Background
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1), legend.position = "bottom")+ # Rotate labels of x and move them slightly down. Plus move the position to the bottom
guides(size = "none") # Remove the Size from the Legend